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Abstract. 

Exact solutions for the learning problem of autoassociative networks with binary 
couplings are determined by a new method: The use of a branch-and-bound algorithm 
leads to a substantial saving of computing time compared to complete enumeration. 
As a result, fully connected networks with up to 40 neurons could be investigated. 
The network capacity is found to be close to 0.83 . 
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The training of neural networks with binary couplings is believed to belong to the class 
of NP-complete problems i.e. the average computing time required to find a solution 
scales exponentially with the number of couplings to determine. This exponential 
scaling is due to the discrete structure of the space of couplings and is obvious in the 
case of complete enumeration. However, theoretical Q and numerical Q studies 
showed that it holds as well for heuristic approaches (e.g. simulated annealing). 
Training by complete enumeration has been carried out for small networks with up to 
25 neurons || [|, Heuristic algorithms ^ ^, |lj were used for networks with up to 
thousand neurons. Still, the main disadvantage of heuristic algorithms consists in the 
uncertainty about the existence of solutions not found by the algorithm. 

Our aim has been to develop an exact algorithm guaranteed to find all possible 
solutions in considerably less computing time than complete enumeration. In ||, 
Gardner showed that the space of interactions in neural network models can be treated 
in a way similar to the phase space of spin glass models. Accordingly, it should be 
possible to use the branch-and-bound method, already successfully applied to the 
search for ground states of a Ising spin glass model || [l(| , for the training of neural 
networks with binary couplings. 

Consider an autoassociative network built of N two-state neurons = ±1 (i = 
1 . . . N) and fully connected by binary synaptic couplings that can take on the values 
Jij = ±1. The self couplings J a should be set to zero. The task of the network 
would be to store a set of patterns £ M (fi — 1 . . .p) with elements £f = ±1. A training 
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procedure determines couplings that make these patterns attractors of the discrete 
network dynamics 
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The capacity of the network specifies the number of different patterns that can be 
stored simultaneously. It is normally expressed as a critical load a c = p c /N . 

For good retrieval one is interested in large basins of attraction. As discussed in 
JHJ [lU, these correspond to large values of the pattern stability 
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The maximally stable rule therefore formulates the learning problem as an 
optimization task: For a given set of patterns, one has to determine an optimal set of 
couplings that maximises the network stability 

K = min(^). (3) 

A" 

As long as there is no symmetry constraint on the matrix of couplings, the optimization 
task separates into the training of N simple perceptrons with N — 1 input neurons, 
corresponding to the individual rows of the matrix with the self-coupling excluded. 
The network stability n emerges as the minimum of the "perceptron stabilities" At;. 

The new learning algorithm was developed using the branch-and-bound method, 
a standard tool of combinatorial optimization theory JL3| : To find a row of the matrix 
of couplings with maximal stability Ki, complete enumeration would check the 2^ JV_1 ) 
possible configurations for optimal ones. Branch-and-bound starts with a division into 
a hierarchy of subproblems: each single coupling is tested with both possible values yet 
taking into account the state of the previously (on a trial basis) determined couplings, 
thus forming a binary tree of "incomplete" configurations. Only the final level of the 
tree would contain the "complete" solutions. This division is the 'branching' part of 
the algorithm. Standing alone it would double the necessary computing time. Here 
the 'bounding' (and subsequently cutting) part comes into action: for each node of the 
binary tree an upper bound for the best possible solution of the remaining subproblem 
is evaluated. Starting point is an ideal stability, = N — 1, which is obtained if one 
takes all terms in the sum (||) to be positive. (Generally, the maximal stability lies 
below this ideal stability which can only be achieved if there is just one pattern to 
store.) When testing a coupling J^ , this bound will be corrected, taking into account 
the already fixed part of the configuration. If it falls under a pre-set value, e.g. the 
stability attained by the use of the clipped Hebb rule, the binary tree is "cut" at this 
node, i.e. the subtree of this node does not need to be considered. As a result, only a 
small percentage of the nodes has to be checked. For N — 25 we found that only ICP 4 
to 8 percent of the nodes were evaluated, depending on, e.g., the number of patterns 
to store. 

Assuming that the evaluation of a node of the binary tree is approximately as 
time consuming as checking one possible configuration during complete enumeration, 
a comparison of these two methods has been done: As predicted by theory, we still 
have an exponential scaling of the algorithm. However, if we set the load a = p/N 
to 0.5 and look for one solution with positive stability, the algorithm scales no longer 
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with 2 N but with 2 aA6N . In the (worst) case of determining all optimal solutions 
at a = 1, the scaling is 3 x 2 8JV . (That would mean, the algorithm still optimizes 
a 30-neuron network in approximately the computing time needed for the complete 
enumeration of a 25-neuron network.) 

We used the branch-and-bound algorithm to determine the capacity of the network 
storing random uncorrelated patterns. Only one row of the coupling matrix was 
considered assuming the stability value to be self-averaging in the thermodynamic 
limit (cf. |,|). 

The procedure resembles the one used in For a given value of N, the stability 
k n (a) is determined for an increasing number of patterns until its value becomes 
negative, signifying that it is no longer possible to store all patterns. Then the capacity 
a c (N) is determined by a linear interpolation between the last positive k n (a+) and the 
negative K N (a-). If the patterns are binary-valued, £f = ±1, K N (a) takes on discrete 
values with a spacing of 2/V7V. For N odd, this discreteness results in two values of 
a c (N) corresponding to the first and last occurrence of k n (a) = 0. This procedure 
was carried out for networks with N = 4 ... 40. To reduce finite size (discretization and 
parity) effects, we also used continuous distributed patterns (cf. ||). We considered a 
normalized Gaussian distribution as well as patterns with elements evenly distributed 
in the interval —1 < £f < +1 (box constraint) to examine the influence of the pattern 
distribution. The figures p] and show the dependence of this capacity on the network 
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Figure 1. Network capacity in the case of random uncorrelated il-patterns 

size as well as on the nature of the patterns. The error bars correspond to twice the 
mean deviation of the average value (statistical error). Sample size varied between 10 
000 for small systems and 100 for TV = 40. The ±l-patterns exhibit a strong parity 
effect which should however vanish in the thermodynamic limit. In figure |^, the values 
for the Gaussian patterns show a periodicity which is probably a result of the linear 
interpolation as the period of six corresponds to the passing the zero-line of a stability 
value k n (a). (Remember that the critical capacity is approximately 5/6 and a is 
restricted to rationals N/p). Quadratic fits are given as a guideline to the eye (cf. B). 



4 



0.95 



0.9 



0.85 - 



0.8 - 



0.75 



1 

Gaussian K>— I 
box constraint 




0.05 



0.1 0.15 

1/N 



0.2 



0.25 



Figure 2. Network capacity in the case of continuous distributed patterns 



There is no scaling theory for this problem, however, our numerical data suggest 
that the extrapolation to N — * oo could not be a linear one. A tentative quadratic 
extrapolation yields a c = 0.834 for Gaussian distributed patterns, a c = 0.832 for the 
box constraint and a c = 0.827 for ±l-patterns and N even. In the case of ±l-patterns 
and TV odd, a quadratic fit is clearly inadmissible. 

A second approach followed the procedure by Krauth and Opper |3) in determining 
K a (N) for different values of a and a subsequent extrapolation to N — > oo. The 
capacity for Gaussian distributed patterns is determined as a c = 0.833. 

We aimed to examine the possibilities and limits of combinatorial optimization 
when used for the training of autoassociative neural networks with binary couplings. 
The developed branch-and-bound algorithm allowed us to extend the exact 
investigation to systems with up to 40 neurons. We were not able to leave the region of 
strong finite size effects but could confirm theoretical Q and numerical || || studies 
with additional numerical evidence. The possibility to determine all solutions of the 
learning problem also opens the way for an analysis of the space of solutions similar 
to the one already done for the ground states of the Ising spin glass model |l(l . 
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